Customer clustering analysis¶
This notebook presents a comprehensive analysis of customer segmentation using clustering techniques. The goal of this analysis is to identify distinct groups of customers based on their behaviors, which can help in understanding the underlying patterns within the customer base and optimizing marketing strategies. By leveraging clustering methods, we aim to identify meaningful customer segments, and subsequently, we use classification techniques to explore how demographic variables can predict cluster membership.
The dataset used for this analysis, titled "Customer Personality Analysis," was sourced from Kaggle and provides a range of customer attributes, including demographic information, product preferences, and spending habits. By applying different clustering methods, we aim to uncover meaningful customer segments that can inform targeted marketing and personalization efforts.
In this notebook, we will walk through the following steps:
- Data Import and Preprocessing: We start by importing the dataset and performing necessary preprocessing steps, including handling missing values, encoding categorical features, and scaling the data.
- Exploratory Data Analysis (EDA): A visual and statistical exploration of the data to understand the distribution and relationships between variables.
- PCA: Reducing dimensionality
- Clustering Analysis: Application of clustering algorithms such as K-Means and Agglomerative Clustering to segment the customers into distinct groups based on spending habits.
- Model Evaluation: Assessment of the clustering results using metrics like silhouette score to determine the optimal number of clusters.
- Classification: After identifying clusters, we employ classification models to predict cluster membership based on demographic features such as age, income, and education level. This step aims to understand how demographic factors relate to the behavioral clusters identified earlier.
- Conclusion and Insights: Summary of the findings and insights derived from the clustering analysis, along with potential implications for marketing strategies.
This analysis aims to provide valuable insights into customer behavior, enabling businesses to tailor their marketing efforts and enhance customer satisfaction. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment
Importing libraries¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import altair as alt
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score, accuracy_score, ConfusionMatrixDisplay
from sklearn.preprocessing import Normalizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')
Importing the Data¶
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d imakash3011/customer-personality-analysis
!unzip customer-personality-analysis.zip
!rm customer-personality-analysis.zip
data = pd.read_csv('marketing_campaign.csv', sep='\t')
data.head()
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 04-09-2012 | 58 | 635 | ... | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 08-03-2014 | 38 | 11 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 21-08-2013 | 26 | 426 | ... | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 10-02-2014 | 26 | 11 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 19-01-2014 | 94 | 173 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
5 rows × 29 columns
EDA¶
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Z_CostContact 2240 non-null int64 27 Z_Revenue 2240 non-null int64 28 Response 2240 non-null int64 dtypes: float64(1), int64(25), object(3) memory usage: 507.6+ KB
There are few missing values in the Income variable. We will simply drop them.
data.dropna(inplace=True)
data.describe()
| ID | Year_Birth | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2216.000000 | 2216.000000 | 2216.000000 | 2216.000000 | 2216.000000 | 2216.000000 | 2216.000000 | 2216.000000 | 2216.000000 | 2216.000000 | ... | 2216.000000 | 2216.000000 | 2216.000000 | 2216.000000 | 2216.000000 | 2216.000000 | 2216.000000 | 2216.0 | 2216.0 | 2216.000000 |
| mean | 5588.353339 | 1968.820397 | 52247.251354 | 0.441787 | 0.505415 | 49.012635 | 305.091606 | 26.356047 | 166.995939 | 37.637635 | ... | 5.319043 | 0.073556 | 0.074007 | 0.073105 | 0.064079 | 0.013538 | 0.009477 | 3.0 | 11.0 | 0.150271 |
| std | 3249.376275 | 11.985554 | 25173.076661 | 0.536896 | 0.544181 | 28.948352 | 337.327920 | 39.793917 | 224.283273 | 54.752082 | ... | 2.425359 | 0.261106 | 0.261842 | 0.260367 | 0.244950 | 0.115588 | 0.096907 | 0.0 | 0.0 | 0.357417 |
| min | 0.000000 | 1893.000000 | 1730.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.0 | 11.0 | 0.000000 |
| 25% | 2814.750000 | 1959.000000 | 35303.000000 | 0.000000 | 0.000000 | 24.000000 | 24.000000 | 2.000000 | 16.000000 | 3.000000 | ... | 3.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.0 | 11.0 | 0.000000 |
| 50% | 5458.500000 | 1970.000000 | 51381.500000 | 0.000000 | 0.000000 | 49.000000 | 174.500000 | 8.000000 | 68.000000 | 12.000000 | ... | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.0 | 11.0 | 0.000000 |
| 75% | 8421.750000 | 1977.000000 | 68522.000000 | 1.000000 | 1.000000 | 74.000000 | 505.000000 | 33.000000 | 232.250000 | 50.000000 | ... | 7.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.0 | 11.0 | 0.000000 |
| max | 11191.000000 | 1996.000000 | 666666.000000 | 2.000000 | 2.000000 | 99.000000 | 1493.000000 | 199.000000 | 1725.000000 | 259.000000 | ... | 20.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 3.0 | 11.0 | 1.000000 |
8 rows × 26 columns
Z_CostContact and Z_Revenue are constants, we can remove them.
data = data.drop(['Z_CostContact', 'Z_Revenue'], axis=1)
Year_Birth and Income looks to have some outliers. We will drop them.
alt.Chart(data).mark_boxplot().encode(
x=alt.X('Year_Birth',timeUnit='year', scale=alt.Scale(domain=[1900,2020])),
) & alt.Chart(data).mark_boxplot().encode(
x=alt.X('Income',timeUnit='year'),
)
data = data[data.Year_Birth >= 1920]
data = data[data.Income <= 500000]
Semantic groups of columns¶
We will group the columns in the dataset ina semantic manner to apply a meaningful preprocessing and analysis to different part of the data.
data.columns
Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
'AcceptedCmp2', 'Complain', 'Response'],
dtype='object')
- ID: Customer's unique identifier
- Dt_Customer: Date of customer's enrollment with the company
- Recency: Number of days since customer's last purchase
- Complain: 1 if the customer complained in the last 2 years, 0 otherwise
People (demographics)
- Year_Birth: Customer's birth year
- Education: Customer's education level
- Marital_Status: Customer's marital status
- Income: Customer's yearly household income
- Kidhome: Number of children in customer's household
- Teenhome: Number of teenagers in customer's household
Products
- MntWines: Amount spent on wine in last 2 years
- MntFruits: Amount spent on fruits in last 2 years
- MntMeatProducts: Amount spent on meat in last 2 years
- MntFishProducts: Amount spent on fish in last 2 years
- MntSweetProducts: Amount spent on sweets in last 2 years
- MntGoldProds: Amount spent on gold in last 2 years
Promotion
- AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
- AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
- AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
- AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
- AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
- Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
Place
- NumDealsPurchases: Number of purchases made with a discount
- NumWebPurchases: Number of purchases made through the company’s website
- NumCatalogPurchases: Number of purchases made using a catalogue
- NumStorePurchases: Number of purchases made directly in stores
- NumWebVisitsMonth: Number of visits to company’s website in the last month
people = list(data.columns[1:7])
print('People: ', people)
product = [x for x in data.columns if x.startswith('Mnt')]
print('Product: ',product)
place = [x for x in data.columns if (x.startswith('Num') and x != 'NumWebVisitsMonth')]
print('Place: ', place)
promotion = [x for x in data.columns if x.startswith('Accepted')]+['Response']
print('Promotion: ', promotion)
People: ['Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome', 'Teenhome'] Product: ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds'] Place: ['NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases'] Promotion: ['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2', 'Response']
Product preferencies¶
This group of variables describes the spending behaviours of customers, based on the product preferencies.
data[product].head()
| MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | |
|---|---|---|---|---|---|---|
| 0 | 635 | 88 | 546 | 172 | 88 | 88 |
| 1 | 11 | 1 | 6 | 2 | 1 | 6 |
| 2 | 426 | 49 | 127 | 111 | 21 | 42 |
| 3 | 11 | 4 | 20 | 10 | 3 | 5 |
| 4 | 173 | 43 | 118 | 46 | 27 | 15 |
We normalize the values for single product to have sum 1 in every row, and we store the total amounts spent trough different products in a new 'product_sum' variable.
data['product_sum'] = data[product].sum(axis=1)
normalizer = Normalizer(norm='l1')
data[product] = normalizer.fit_transform(data[product])
product_titles=[]
product_charts=[]
for i, col in enumerate(product):
title = alt.Chart().mark_text(align='right', dx=5).encode(
text=alt.value(col)
)
chart = alt.Chart(data[product]).mark_boxplot().encode(
x = alt.X(col, title=None, scale=alt.Scale(type='sqrt'))#, axis=None if i<len(product)-1 else alt.Axis(title=None))
)
product_charts.append(chart)
product_titles.append(title)
alt.hconcat(alt.vconcat(*product_titles),
alt.vconcat(*product_charts,
spacing=0
).resolve_scale(x='shared')
)
Wines are the best seller products, followed by meat products.
alt.Chart(data).mark_point().encode(
alt.X(alt.repeat("column"), type='quantitative'),
alt.Y(alt.repeat("row"), type='quantitative')
).properties(
width=200,
height=200
).repeat(
row=product,
column=product[::-1]
)
Place preferencies¶
This group of variables describes the preferencies about the place and moments fpr spending.
data[place].head()
| NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | |
|---|---|---|---|---|
| 0 | 3 | 8 | 10 | 4 |
| 1 | 2 | 1 | 1 | 2 |
| 2 | 1 | 8 | 2 | 10 |
| 3 | 2 | 2 | 0 | 4 |
| 4 | 5 | 5 | 3 | 6 |
We normalize this data by row.
data[place] = normalizer.fit_transform(data[place])
product_titles=[]
product_charts=[]
for i, col in enumerate(place):
title = alt.Chart().mark_text(align='right', dx=5).encode(
text=alt.value(col)
)
chart = alt.Chart(data[place]).mark_boxplot().encode(
x = alt.X(col, title=None, scale=alt.Scale(type='sqrt'))#, axis=None if i<len(product)-1 else alt.Axis(title=None))
)
product_charts.append(chart)
product_titles.append(title)
alt.hconcat(alt.vconcat(*product_titles),
alt.vconcat(*product_charts,
spacing=0
).resolve_scale(x='shared')
)
Store Purchases are the most frequent, followed by Web purchases.
alt.Chart(data).mark_point().encode(
alt.X(alt.repeat("column"), type='quantitative'),
alt.Y(alt.repeat("row"), type='quantitative')
).properties(
width=200,
height=200
).repeat(
row=place,
column=place[::-1]
)
Promotion response¶
data[promotion].head()
| AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Response | |
|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 |
We will summarize the information in a 'promo' variable, storing the sum of the response in every campaign.
data['promo'] = data[promotion].sum(axis=1)
Correlation matrix¶
columns = product + place + ['product_sum', 'promo']
corr_mat = data[columns].corr().reset_index().melt(id_vars='index')
heat_map = alt.Chart(corr_mat, title='Correlation Matrix').mark_rect().encode(
alt.X('index:N'),
alt.Y('variable:N'),
alt.Color('value:Q', scale=alt.Scale(scheme='viridis')),
).properties(
width=750,
height=750
)
text = alt.Chart(corr_mat).mark_text(baseline='middle').encode(
alt.Text('value:Q', format=".2f"),
alt.X('index:N'),
alt.Y('variable:N')
)
heat_map + text
Summing up:
- the Wines amount is negatively correlated to the amounts of all the other categories of products.
- the product_sum (total amount of spending) is strongly correlated to NumCatalogPurchases and negatively correlated to NumDealsPurchases.
- NumWebPurchases is negatevely correlated to NumStorePurchases.
- MntGoldProds is negatevely correlated to prosuct_sum.
Preprocessing¶
Selecting variables for clustering¶
We want to conduct clustering based on a subgroup of variables that describe the spending behaviours of the customer.
columns = product + place + ['product_sum', 'promo']
X = data[columns]
Standardize features by removing the mean and scaling to unit variance.¶
scaler_std = StandardScaler()
X_std = scaler_std.fit_transform(X)
Dimensional reduction¶
pca = PCA().fit(X_std)
ex_var_ratio =pd.DataFrame(enumerate(np.cumsum(pca.explained_variance_ratio_)), columns=['Principal Component', 'Cumulative Explained Variance'])
baseplt = alt.Chart(ex_var_ratio).mark_line().encode(
x='Principal Component',
y='Cumulative Explained Variance',)
baseplt+alt.Chart(ex_var_ratio).mark_line(color='red').encode(
y=alt.datum(0.9),
x=alt.X('Principal Component:Q')
)
for i,x in enumerate(pca.explained_variance_ratio_.cumsum()):
if x > 0.9:
print(f'N components: {i}')
break
X_reduced = PCA(n_components=i, random_state=11).fit_transform(X_std)
N components: 7
Determine the number of clusters to be formed¶
We use the elbow method and the silhouette score to choose the best number of clusters. We perform the clustering using Kmeans algorithm for different values of k, then compare the results on charts.
sse = {}
sil = []
for k in range(2, 18):
kmeans = KMeans(n_clusters = k, random_state=29).fit(X_reduced)
sil.append(silhouette_score(X_reduced, kmeans.labels_, metric = 'euclidean'))
sse[k] = kmeans.inertia_
sil_df = pd.DataFrame(enumerate(sil), columns=['k', 'value']).assign(method='Silhouette score')
elb_df = pd.DataFrame(sse.items(), columns=['k', 'value']).assign(method='Elbow')
alt.Chart(pd.concat([sil_df, elb_df]).reset_index(drop=True)).mark_line().encode(
x='k',
y='value',
color='method'
).facet('method', columns=1
).resolve_axis(
y='independent'
).resolve_scale(
y='independent'
)
3 clusters seems to be a reasonable choice looking at the charts.
Hierarchical Clustering¶
We want to create clusters of clients based on their spending behaviour, then we want to use classification based on the demographic to predict customer's cluster. Conducting the analysis at the end we get that the best overall performance is obtained clustering in 3 groups. The 3 groups will result well defined and distinct in behaviours and in demographic characteristics.
hclust = AgglomerativeClustering(n_clusters = 3).fit_predict(X_reduced)
data['cluster'] = hclust
product_chart = alt.Chart(data).mark_boxplot().encode(
alt.Y(alt.repeat("column"), type='quantitative', scale=alt.Scale(type='sqrt')),
alt.X('cluster:N'),
alt.Color('cluster:N')
).repeat(
column=list(product)
).resolve_scale(
y='shared'
)
sum_product_chart = alt.Chart(data).mark_bar().encode(
alt.X('mean(product_sum):Q'),
alt.Y('cluster:N'),
alt.Color('cluster:N')
)
product_chart & sum_product_chart
Cluster 0: High Expenditure (preferencies: Wine and Meat)
Cluster 1: Low Expenditure (preferencies: Gold and Meat) this is the more balanced group
Cluster 2: Medium Expenditure (preferencies: Wine and Meat)
promo_chart = alt.Chart(data).mark_bar().encode(
alt.X('mean(promo):Q'),
alt.Y('cluster:N'),
alt.Color('cluster:N')
)
promo_chart
The chart is similar to the Mean of product_sum chart, but the differences between groups are reduced, suggesting the following:
Group 0: low attention to promotions
Group 1: high attention to promotions
Group 2: medium attention to promotions
place_chart = alt.Chart(data).mark_boxplot().encode(
alt.Y(alt.repeat("column"), type='quantitative', scale=alt.Scale(type='sqrt')),
alt.X('cluster:N'),
alt.Color('cluster:N')
).repeat(
column=list(place)
).resolve_scale(
y='shared'
)
place_chart
Group 0: Low deals purchases and high Catalog Purchases
Group 1: High deals purchases
Group 2: Medium deals purchases
Note: Group 2 has a small higher percentage of Web Purchases
Analysis of demographics by clusters and classification¶
alt.Chart(data).mark_bar().encode(
alt.Y('cluster:N'),
alt.Color('cluster:N'),
alt.X('count()')
)
## Categorical factors
# The Education factor can be changed to an ordinal categorical type.
Education_order = ['Basic', '2n Cycle', 'Graduation', 'Master', 'PhD']
data['Education'] = pd.Categorical(data.Education, Education_order)
edu_chart = alt.Chart(data).mark_bar().encode(
alt.Y('Education',sort=Education_order),
alt.X('count()'),
alt.Row('cluster:N'),
alt.Color('cluster:N'),
tooltip = ['Education', 'count()']
)
# I will change 'Alone' to 'Single' and remove the rows for 'YOLO' and 'Absurd' value.
data.loc[data.Marital_Status == 'Alone', 'Marital_Status'] = 'Single'
data = data[data.Marital_Status != 'Absurd']
data = data[data.Marital_Status != 'YOLO']
marital_chart = alt.Chart(data).mark_bar().encode(
alt.Y('Marital_Status'),
alt.X('count()'),
alt.Row('cluster:N'),
alt.Color('cluster:N'),
tooltip = ['Marital_Status', 'count()']
)
edu_chart | marital_chart
kid_chart = alt.Chart(data).mark_bar().encode(
alt.Y('Kidhome:O'),
alt.X('count()'),
alt.Row('cluster:N'),
alt.Color('cluster:N')
)
teen_chart = alt.Chart(data).mark_bar().encode(
alt.Y('Teenhome:O'),
alt.X('count()'),
alt.Row('cluster:N'),
alt.Color('cluster:N')
)
kid_chart | teen_chart
age_chart = alt.Chart(data).mark_bar().encode(
alt.Y('Year_Birth:Q', bin=alt.Bin(maxbins=6)),
alt.X('count()'),
alt.Row('cluster:N'),
alt.Color('cluster:N')
)
income_chart = alt.Chart(data).mark_bar().encode(
alt.Y('Income:Q', bin=alt.Bin(maxbins=10)),
alt.X('count()'),
alt.Row('cluster:N'),
alt.Color('cluster:N')
)
age_chart | income_chart
Encoding categorical factors¶
X = data[people].copy()
enc_ord=OrdinalEncoder()
X[['Education']] = enc_ord.fit_transform(X[['Education']])
enc_one=OneHotEncoder(drop='first', sparse_output=False)
df = pd.DataFrame(enc_one.fit_transform(X[['Marital_Status']]),
index=X.index,
columns=enc_one.get_feature_names_out())
X[df.columns] = df
X.drop('Marital_Status', axis=1, inplace=True)
y = data['cluster']
Classification¶
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
parameters = dict(n_estimators=[200, 300],
criterion=['gini', 'entropy'],
max_depth=[None, 2, 5, 10],
min_samples_split=[2, 5, 10],
min_samples_leaf=[1, 2, 4])
rndf = RandomForestClassifier()
clf = GridSearchCV(rndf, parameters)
clf.fit(X_train, y_train)
print('Train: ', clf.best_score_, clf.best_params_)
Train: 0.7180622995877233 {'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 300}
clf.score(X_test, y_test)
0.7023319615912208
ConfusionMatrixDisplay.from_estimator(
clf,
X_test,
y_test,
cmap=plt.cm.Blues
)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1899e69b950>
Conclusions¶
In this notebook, we conducted a comprehensive analysis of customer segmentation through clustering techniques and subsequently explored the relationship between these clusters and demographic variables. The clustering analysis successfully identified distinct customer groups, each characterized by unique purchasing behaviors and preferences.
Following the clustering, a classification analysis based on demographic features such as age, income, and education level revealed significant relationships between these variables and the identified clusters. This connection between demographic factors and customer behavior underscores the importance of personalized marketing strategies.
The results of this analysis can be used to enhance targeted marketing efforts, improve customer engagement, and ultimately drive business growth. By understanding the specific needs and preferences of different customer segments, companies can tailor their offerings and communications more effectively.